{
 "cells": [
  {
   "cell_type": "markdown",
   "source": [
    "# GENCODE\n",
    "https://www.gencodegenes.org\n"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "import hail as hl\n",
    "hl.init(spark_conf={\"spark.hadoop.fs.gs.requester.pays.mode\": \"CUSTOM\",\n",
    "                    \"spark.hadoop.fs.gs.requester.pays.project.id\": \"broad-ctsa\",\n",
    "                    \"spark.hadoop.fs.gs.requester.pays.buckets\": \"hail-datasets-tmp,hail-datasets-us,hail-datasets-eu\"})"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Create GENCODE v35 annotation Hail Table:\n",
    "**GENCODE Release 35 (GRCh38.p13):**\n",
    "https://www.gencodegenes.org/human/release_35.html\n",
    "\n",
    "**Comprehensive gene annotation GTF file (on reference chromosomes only):**\n",
    "https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_35/gencode.v35.annotation.gtf.gz\n",
    "\n",
    "**To download comprehensive gene annotation GTF file used to create Hail Table:**\n",
    "Run `extract_gencode_v35_annotation_gtf.sh` to download the GTF file to the `hail-datasets-tmp` bucket."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "overwrite = False\n",
    "ht = hl.experimental.import_gtf(\"gs://hail-datasets-tmp/GENCODE/gencode.v35.annotation.gtf.bgz\",\n",
    "                                reference_genome=\"GRCh38\",\n",
    "                                skip_invalid_contigs=True,\n",
    "                                min_partitions=4)\n",
    "\n",
    "# Check all str fields for stray semicolons that remain after GTF import and remove them\n",
    "# e.g. set(ht.transcript_support_level.collect()) == {'3','2','3;','2;','1'}, but should be {1,2,3}\n",
    "fields = list(ht.row_value)\n",
    "str_fields = [f for f in fields if f not in {\"score\", \"frame\"}]\n",
    "ht = ht.annotate(**{f: ht[f].replace(\";\", \"\") for f in str_fields})\n",
    "ht = ht.annotate(**{f: hl.int32(ht[f]) for f in [\"level\", \"exon_number\"]})\n",
    "\n",
    "# Restore original order of table fields after annotating and checkpoint\n",
    "ht = ht.select(*fields)\n",
    "ht = ht.checkpoint(\"gs://hail-datasets-tmp/GENCODE/v35/GRCh38/annotation.ht\",\n",
    "                   overwrite=overwrite,\n",
    "                   _read_if_exists=not overwrite)\n",
    "ht.describe()\n",
    "ht.show()"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "# Read in checkpointed table, write out to hail-datasets GCS buckets\n",
    "ht = hl.read_table(\"gs://hail-datasets-tmp/GENCODE/v35/GRCh38/annotation.ht\")\n",
    "ht.write(\"gs://hail-datasets-us/GENCODE/v35/GRCh38/annotation.ht\")\n",
    "ht.write(\"gs://hail-datasets-eu/GENCODE/v35/GRCh38/annotation.ht\")"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "# Read in checkpointed table, write out to hail-datasets S3 bucket\n",
    "# To read/write to S3, need to set authentication properties in spark_conf\n",
    "ht = hl.read_table(\"gs://hail-datasets-tmp/GENCODE/v35/GRCh38/annotation.ht\")\n",
    "ht.write(\"s3a://hail-datasets-us-east-1/GENCODE/v35/GRCh38/annotation.ht\")\n"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}